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METHODS FOR MOLECULAR TOXICOLOGY MODELING 
INVENTORS: James C. DIGGANS and Michael ELASHOFF 

RELATED APPLICATIONS 

[0001] This application claims the benefit of U.S. Provisional Application Ser. No. 
60/554,981, filed March 22, 2004 and U.S. Provisional Application Ser. No. 60/613,831, 
filed September 29, 2004, both of which are herein incorporated by reference in their entirety 
for all purposes. This application also claims priority to PCT Application No. 
PCT/US03/37556, filed November 24, 2003, which is herein incorporated by reference in its 
entirety for all purposes. 

SEQUENCE LISTING SUBMISSION ON COMPACT DISC 

[0002] The Sequence Listing submitted concurrently herewith on compact disc under 37 
C.F.R. §§ 1.821(c) and 1.821(e) is herein incorporated by reference in its entirety. Four copies 
of the Sequence Listing, one on each of four compact discs are provided. Copy 1, Copy 2 
and Copy 3 are identical. Copies 1, 2 and 3 are also identical to the CRF. Each electronic 
copy of the Sequence Listing was created on November 22, 2004 with a file size of 2398 KB. 
The file names are as follows: Copy 1- gene logic 5133-wo.txt; Copy 2- gene logic 5133- 
wo.txt; Copy 3- gene logic 5133-wo.txt; CRF- gene logic 5133-wo.txt. 

BACKGROUND OF THE INVENTION 

[0003] The need for methods of assessing the toxic impact of a compound, pharmaceutical 
agent or environmental pollutant on a cell or living organism has led to the development of 
procedures which utilize living organisms as biological monitors. The simplest and most 
convenient of these systems utilize unicellular microorganisms such as yeast and bacteria, 
since they are the most easily maintained and manipulated. In addition, unicellular screening 
systems often use easily detectable changes in phenotype to monitor the effect of test 
compounds on the cell. Unicellular organisms, however, are inadequate models for 
estimating the potential effects of many compounds on complex multicellular animals, as 
they do not have the ability to carry out biotransformations. 

[0004] The biotransformation of chemical compounds by multicellular organisms is a 
significant factor in determining the overall toxicity of agents to which they are exposed. 
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Accordingly, multicellular screening systems may be preferred or required to detect the toxic 
effects of compounds. The use of multicellular organisms as toxicology screening tools has 
been significantly hampered, however, by the lack of convenient screening mechanisms or 
endpoints, such as those available in yeast or bacterial systems. Additionally, certain 
previous attempts to produce toxicology prediction systems have failed to provide the 
necessary modeling data and statistical information to accurately predict toxic responses (e.g., 
WO 00/12760, WO 00/47761, WO 00/63435, WO 01/32928, and WO 01/38579). 
[0005] The pharmaceutical industry spends significant resources to ensure that therapeutic 
compounds of interest are not toxic to human beings. This process is lengthy as well as 
expensive and involves testing in a series of organisms starting with rats and progressing to 
dogs or non-human primates. Moreover, modeling methods for designing candidate 
pharmaceuticals and their synthesis in nucleic acid, peptide or organic compound libraries 
has increased the need for inexpensive, fast and accurate methods to predict toxic responses. 
Toxicity modeling methods based on nucleic acid hybridization platforms would allow the 
use biological samples from compound-exposed animal or cell culture samples, such as rats 
or rat hepatocyte cell cultures, to detect human organ toxicity much earlier than has been 
possible to date. 

SUMMARY OF THE INVENTION 

[0006] The present invention is based, in part, on the elucidation of the global changes in 
gene expression in animal tissues or cells, such as liver or kidney tissue or cells, exposed to 
known toxins, in particular hepatotoxins or renal toxins, as compared to unexposed tissues or 
cells, as well as the identification of individual genes that are differentially expressed upon 
toxin exposure. 

[00071 In various aspects, the invention includes methods of predicting at least one toxic 
effect of a test agentby comparing gene expression information from agent-exposed samples 
to a database of gene expression information from toxin-exposed and control samples 
(vehicle-exposed samples or samples exposed to a non-toxic compound or low levels of a 
toxic compound). These methods comprise providing or generating quantitative gene 
expression information from the samples, converting the gene expression information to 
matrices of fold-change values by a robust multi-array average (RMA) algorithm, generating 
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a gene regulation score for each gene that is differentially expressed upon exposure to the test 
agent by a partial least squares (PLS) algorithm, and calculating a sample prediction score for 
the test agent. This sample prediction score is then compared to a reference prediction score 
for one or more toxicity models. If the sample prediction score is equal to or greater than the 
reference prediction score, the test agent can be predicted to have at least one toxic effect or 
to produce at least one pathology corresponding to the toxicity model to which the test 
agent's prediction score is compared. 

[0008] In various aspects, the invention includes methods of creating a toxicology model. 
These methods comprise providing or generating quantitative nucleic acid hybridization data 
for a plurality of genes from at least one cell or tissue sample exposed to a toxin and at least 
one cell or tissue sample exposed to the toxin vehicle, converting the hybridization data from 
at least one gene to a gene expression measure, such as fold-change value, by a robust multi- 
array average (RMA) algorithm, generating a gene regulation score from a gene expression 
measure for at least one gene by a partial least squares (PLS) algorithm, and generating a 
toxicity reference prediction score for the toxin, thereby creating a toxicology model. 
[0009] In other aspects, the invention includes a computer system comprising a computer 
readable medium containing a toxicity model for predicting the toxicity of a test agent and 
software that allows a user to predict at least one toxic effect of a test agent by comparing a 
sample prediction score for the test agent to a toxicity reference prediction score for the 
toxicity model. 

[0010] In further aspects of the invention, the gene expression information from test agent- 
exposed tissues or cells may be prepared as text or binary files, such as CEL files, and 
transmitted via the Internet for analysis and comparisons to the toxicity models stored on a 
remote, central server. After processing, the user that sent the text files receives a report 
indicating the toxicity or non-toxicity of the test agent. 

[0011] In other aspects of the invention, the user may download one or more toxicity models 
from the remote, central server, as well as software for manipulating the user's data and the 
toxicity models, to a local server. Gene expression information from test agent-exposed 
tissues or cells may then be prepared as text files, such as CEL files, and analyzed and 
compared at the user's site to the toxicity models stored on the local server. After processing, 
the software generates a report indicating the toxicity or non-toxicity of the test agent. 
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TABLES 

[0012] Table 1: Table 1 provides the GLGC identifier (fragment names from Table 2) in 
relation to the SEQ ID NO. and GenBank Accession number for each of the gene fragments 
listed in Table 2 (all of which are herein incorporated by reference and replication in the 
attached sequence listing). The gene names and Unigene cluster titles are also included. 
[0013] Table 2: Table 2 presents the PLS scores (weighted gene index scores) from an 
exemplary kidney general toxicity model. 

DETAILED DESCRIPTION 

Definitions 

[0014] As used herein, "nucleic acid hybridization data" refers to any data derived from the 
hybridization of a sample of nucleic acids to a one or more of a series of reference nucleic 
acids. Such reference nucleic acids may be in the form of probes on a microarray or set of 
beads or may be in the form of primers that are used in polymerization reactions, such as 
PCR amplification, to detect hybridization of the primers to the sample nucleic acids. 
Nucleic hybridization data may be in the form of numerical representations of the 
hybridization and may be derived from quantitative, semi-quantitative or non-quantitative 
analysis techniques or technology platforms. Nucleic acid hybridization data includes, but is 
not limited to gene expression data. The data may be in any form, including florescence data 
or measurements of florescence probe intensities from a microarray or other hybridization 
technology platform. The nucleic acid hybridization data may be raw data or may be 
normalized to correct for, or take into account, background or raw noise values, including 
background generated by microarray high/low intensity spots, scratches, high regional or 
overall background and raw noise generated by scanner electrical noise and sample quality 
fluctuation. 

[0015] As used herein, "cell or tissue samples" refers to one or more samples comprising cell 
or tissue from an animal or other organism, including laboratory animals such as rats or mice. 
The cell or tissue sample may comprise a mixed population of cells or tissues or may be 
substantially a single cell or tissue type, such as hepatocytes or liver tissue. Cell or tissue 
samples as used herein may also be in vitro grown cells or tissue, such as primary cell 
cultures, immortalized cell cultures, cultured hepatocytes, cultured liver tissue, etc.. Cells or 
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tissue may be derived from any organ, including but not limited to, liver, kidney, cardiac, 
muscle (skeletal or cardiac) or brain. 

[0016] As used herein, "test agent" refers to an agent, compound or composition that is being 
tested or analyzed in a method of the invention. For instance, a test agent may be a 
pharmaceutical candidate for which toxicology data is desired. 

[0017] As used herein, "test agent vehicle" refers to the diluent or carrier in which the test 
agent is dissolved, suspended in or administered in, to an animal, organism or cells. 
[0018] As used herein, "toxin vehicle" refers to the diluent or carrier in which a toxin is 
dissolved, suspended in or administered in, to an animal, organism or cells. 
[0019] As used herein, a "gene expression measure" refers to any numerical representation of 
the expression level of a gene or gene fragment in a cell or tissue sample. A "gene 
expression measure" includes, but is not limited to, a fold-change value. 
[0020] As used herein, "at least one gene" refers to a nucleic acid molecule detected by the 
methods of the invention in a sample. The term "gene" as used herein, includes fully 
characterized open reading frames and the encoded mRNA as well as fragments of expressed 
RNA that are detectable by any hybridization method in the cell or tissue samples assayed as 
described herein. For instance, a "gene" includes any species of nucleic acid that is 
detectable by hybridization to a probe in a microarray , such as the "genes" of Table 1 . As 
used herein, at least one gene includes a "plurality of genes." 

[0021] As used herein, "fold-change value" refers to a numerical representation of the 
expression level of a gene, genes or gene fragments between experimental paradigms, such as 
a test or treated cell or tissue sample, compared to any standard or control. For instance, a 
fold-change value may be presented as microarray-derived florescence or probe intensities 
for a gene or genes from a test cell or tissue sample compared to a control, such as an 
unexposed cell or tissue sample or a vehicle-exposed cell or tissue sample. An RMA fold- 
change value as described herein is a non-limiting example of a fold-change value calculated 
by methods of the invention. 

[0022] As used herein, "gene regulation score" refers to a quantitative measure of gene 
expression for a gene or gene fragment as derived from a weighted index score or PLS score 
for each gene and the fold-change value from treated vs. control samples. 
[0023] As used herein, "sample prediction score" refers to a numerical score produced via 
methods of the invention as herein described. For instance, a "sample prediction score" may 
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be calculated using the PLS weight or PLS score for at least one gene in a gene expression 
profile generated from the sample and the RMA fold-change value for that same gene. A 
"sample prediction score" is derived from summing the individual gene regulation scores 
calculated for a given sample. 

[0024] As used herein, "toxicity reference prediction score" refers to a numerical score 
generated from a toxicity model that can be used as a cut-off score to predict at least one 
toxic effect of a test agent. For instance, a sample prediction score can be compared to a 
toxicity reference prediction score to determine if the sample score is above or below the 
toxicity reference prediction score. Sample prediction scores falling below the value of a 
toxicity reference prediction score are scored as not exhibiting at least one toxic effect and 
sample prediction scores above the value if a toxicity reference prediction score are scored as 
exhibiting at least one toxic effect. 

[0025] As used herein, a log scale linear additive model includes any log-liner model such as 
log scale robust multi-array average or RMA (Irizarry et aL, Nucleic Acids Research 31(4) 
el 5 (2003). 

[0026] As used herein, "remote connection" refers to a connection to a server by a means 
other than a direct hard-wired connection. This term includes, but is not limited to, 
connection to a server through a dial-up line, broadband connection, Wi-Fi connection, or 
through the Internet. 

[0027] As used herein, a "CEL file" refers to a file that contains the average probe intensities 
associated with a coordinate position, cell or feature on a microarray (such information 
provided by the CDF or 1LQ file). See Affymetrix GeneChip® Expression Analysis 
Technical Manual, which is herein 

[0028] As used herein, a "gene expression profile" comprises any quantitative representation 
of the expression of at least one mRNA species in a cell sample or population and includes 
profiles made by various methods such as differential display, PCR, microarray and other 
hybridization analysis, etc. 

Methods of Generating Toxicity Models 
[0029] To evaluate and identify gene expression changes that are predictive of toxicity, 
studies using selected compounds with well characterized toxicity may be used to build a 
model or database of the present invention. Methods of the present invention include an 
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RMA/PLS method (analysis of raw gene expression data by the robust multi-array average 
algorithm, with evaluation of predictive ability by the partial least squares algorithm) to 
create models and databases for predicting toxicity. 

[0030] In general, cell and tissue samples are analyzed after exposure to compounds known 
to exhibit at least one toxic effect. Low doses of these compounds, or the vehicles in which 
they were prepared, are used as negative controls. Compounds that are known not to exhibit 
at least one toxic effect may also be used as negative controls. 

[0031J In the present invention, a toxicity study or "tox study" comprises a set of cell or 
tissue samples that have been exposed to one or more toxins and may include matched 
samples exposed to the toxin vehicle or a low, non-toxic, dose of the toxin. As described 
below, the cell or tissue samples may be exposed to the toxin and control treatments in vivo 
or in vitro. In some studies, toxin and control exposure to the cell or tissue samples may take 
place by administering an appropriate dose to an animal model, such as a laboratory rat. In 
some studies, toxin and control exposure to the cell or tissue samples may take place by 
administering an appropriate dose to a sample of in vitro grown cells or tissue, such as 
primary rat or human hepatocytes. These samples are typically organized into cohorts by test 
compound, time (for instance, time from initial test compound dosage to time at which rats 
are sacrificed), and dose (amount of test compound administered). All cohorts in a tox study 
typically share the same vehicle control. For example, a cohort may be a set of samples from 
rats that were treated with acyclovir for 6 hours at a high dosage (100 mg/kg). A time- 
matched vehicle cohort is a set of samples that serve as controls for treated animals within a 
tox study, e.g., for 6-hour acyclovir-treated high dose samples the time-matched vehicle 
cohort would be the 6-hour vehicle-treated samples with that study. 
[0032] A toxicity database or "tox database" is a set of tox studies that alone or in 
combination comprise a reference database. For instance, a reference database may include 
data from rat tissue and cell samples from rats that were treated with different test_compounds 
at different dosages and exposed to the test compounds for varying lengths of time. 
[0033] RMA, or robust multi-array average, is an algorithm that converts raw fluorescence 
intensities, such as those derived from hybridization of sample nucleic acids to an Affymetrix 
GeneChip® microarray, into expression values, one value for each gene fragment on a chip 
(Irizarry et al. (2003), Nucleic Acids Res. 3 l(4):el5, 8 pp.; and Irizarry et al. (2003) 
"Exploration, normalization, and summaries of high density oligonucleotide array probe level 
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data," Biostatistics 4(2): 249-264). RMA produces values on a log2 scale, typically between 
4 and 12, for genes that are expressed significantly above or below control levels. These 
RMA values can be positive or negative and are centered around zero for a fold-change of 
about 1. A matrix of gene expression values generated by RMA can be subjected to PLS to 
produce a model for prediction of toxic responses, e.g., a model for predicting liver or kidney 
toxicity. In a preferred embodiment, the model is validated by techniques known to those 
skilled in the art. Preferably, a cross-validation technique is used. In such a technique, the 
data is randomly broken into training and test sets several times until model success rate is 
determined. Most preferably, such technique uses 2/3 / 1/3 cross-validation, where 1/3 of the 
data is dropped and the other 2/3 is used to rebuild the model. 

(0034] PLS, or Partial Least Squares, is a modeling algorithm that takes as inputs a matrix of 
predictors and a vector of supervised scores to generate a set of prediction weights for each of 
the input predictors (Nguyen et al (2002), Bioinformatics 18:39-50). These prediction 
weights are then used to calculate a gene regulation score to indicate the ability of each 
analyzed gene to predict a toxic response. As described in the examples, the gene regulation 
scores may then be used to calculate a toxicity reference prediction score. 
[0035] From the nucleic acid hybridization data, a gene expression measure is calculated for 
one or more genes whose level of expression is detected in the nucleic acid hybridization 
value. As described above, the gene expression measure may comprise an RMA fold-change 
value. The toxicity reference score = Swj R FCi . "i" is the index number for each gene in a 
gene expression profile to be evaluated, "w" is the PLS weight (or PLS score, see Table 2) 
for each gene. "R FCi " is the RMA fold-change value for the i lh gene, as determined from a 
normalized RMA matrix of gene expression data from the sample (described above). The 
PLS weight multiplied by the RMA fold-change value gives a gene regulation score for each 
gene, and the regulation scores for all the individual genes are added to give a toxicity 
reference prediction score for a sample or cohort of sample. A toxicity reference prediction 
score can be calculated from at least one gene regulation score, or at least about 5, 10, 25, 50, 
100, 500 or about 1,000 or more gene regulation scores. 

[0036] In one embodiment of the invention, a toxicology or toxicity model of the invention is 
prepared or created by the steps of (a) providing nucleic acid hybridization data for a plurality 
of genes from at least one cell or tissue sample exposed to a toxin and at least one cell or 
tissue sample exposed to the toxin vehicle; (b) converting the hybridization data from at least 
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one gene to a gene expression measure; (c) generating a gene regulation score from gene 
expression measure for said at least one gene; and (d) generating a toxicity reference 
prediction score for the toxin, thereby creating a toxicology model. The gene expression 
measure may be a gene fold-change value calculated by a log scale linear additive model 
such as RMA and the toxicity reference prediction score may be generated with PLS. The 
toxicity reference prediction score may then be added to a toxicity model or database and be 
used to predict at least one toxic effect of an unknown test agent or compound. 
[0037] In another preferred embodiment, the model is validated by techniques known to 
those skilled in the art. Preferably, a cross-validation technique is used. In such a technique, 
the data is randomly broken into training and test sets several times until an acceptable model 
success rate is determined. Most preferably, such technique uses 2/3 / 1/3 cross-validation, 
where 1/3 of the data is dropped and the other 2/3 is used to rebuild the model. 

Methods of Predicting Toxic Effects 

[0038] The gene regulation scores and toxicity prediction scores derived from cell or tissue 
samples exposed to toxins may be used to predict at least one toxic effect, including the 
hepatotoxicity, renal toxicity or other tissue toxicity of a test or unknown agent or compound. 
The gene regulation scores and toxicity prediction scores from cell or tissue samples exposed 
to toxins may also be used to predict the ability of a test agent or compound to induce a tissue 
pathology, such as liver necrosis, in a sample. The toxicology prediction methods of the 
invention are limited only by the availability of the appropriate toxicology model and 
toxicology prediction scores. For instance, the prediction methods of a given system, such as 
a computer system or database of the invention, can be expanded simply by running new 
toxicology studies and models of the invention using additional toxins or specific tissue 
pathology inducing agents and the appropriate cell or tissue samples. 
[0039] As used, herein, at least one toxic effect includes, but is not limited to, a detrimental 
change in the physiological status of a cell or organism. The response may be, but is not 
required to be, associated with a particular pathology, such as tissue necrosis. Accordingly, 
the toxic effect includes effects at the molecular and cellular level. Hepatotoxicity, for 
instance, is an effect as used herein and includes but is not limited to the pathologies of: 
cholestasis, genotoxicity/carcinogenesis, hepatitis, human-specific toxicity, induction of liver 
enlargement, steatosis, macrovesicular steatosis, microvesicular steatosis, necrosis, non- 
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genotoxic/non-carcinogenic toxicity, peroxisome proliferation, rat non-genotoxic toxicity, 
and general hepatotoxicity. 

[0040] In general, assays to predict the toxicity of a test agent (or compound or multi- 
component composition) comprise the steps of exposing a cell or tissue sample or population 
of cell or tissue samples to the test agent or compound, providing nucleic acid hybridization 
data for at least one gene from the test agent exposed cell or tissue sample(s), by, for instance, 
assaying or measuring the level of relative or absolute gene expression of one or more of the 
genes, such as one or more of the genes in Table 2, calculating a sample prediction score and 
comparing the sample prediction score to one or more toxicology reference scores (see 
Example 1). 

[0041] Sample prediction scores may be calculated as follows: sample prediction score = £ 
Wi R FCi . "i" is the index number for each gene in a gene expression profile to be evaluated, 
"w" is the PLS weight (or PLS score) for each gene derived from a toxicity model. "R FCi " is 
the RMA fold-change value for the i* gene, as determined from a normalized RMA matrix of 
gene expression data from the sample (described above). The PLS weight from a given 
model multiplied by the RMA fold-change value gives a gene regulation score for each gene, 
and the regulation scores for all the individual genes are added to give a prediction score for 
the sample. 

[0042] Nucleic acid hybridization data may include any measurement of the hybridization, 
including gene expression levels, of sample nucleic acids to probes corresponding to about 
(or at least) 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 50, 75, 100, 200, 500, 1000 or more genes, 
or ranges of these numbers, such as about 2-10, about 10-20, about 20-50, about 50-100, 
about 100-200, about 200-500 or about 500-1000 genes. Nucleic acid hybridization data for 
toxicity prediction may also include the measurement of nearly all the genes in a toxicity 
model. "Nearly all" the genes may be considered to mean at least 80% of the genes in any 
one toxicity model 

[0043] The methods of the invention to predict at least one toxic effect of a test agent or 
compound may be practiced by one individual or at one location, or may be practiced by 
more than one individual or at more than one location. For instance, methods of the 
invention include steps wherein the exposure of a test agent or compound to a cell or tissue 
sample(s) is accomplished in one location, nucleic acid processing and the generation of 
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nucleic acid hybridization data takes place at another location and gene regulation and sample 
prediction scores calculated or generated at another location. 

[0044] In another embodiment of the invention, cell or tissue samples are exposed to a test 
agent or compound by administering the agent to laboratory rats and nucleic acids are 
processed from selected tissues and hybridized to a microarray to produce nucleic acid 
hybridization data. The nucleic acid hybridization data is then sent to a remote server 
comprising a toxicology reference database and software that enables generation of 
individual gene regulation scores and one or more sample prediction scores from the nucleic 
acid hybridization data. The software may also enable a user to pre-select specific toxicology 
models and to compare the generated sample prediction scores to one or more toxicology 
reference scores contained within a database of such scores. The user may then generate or 
order an appropriate output product(s) that presents or represents the results of the data 
analysis, generation of gene regulation scores, sample prediction scores and/or comparisons 
to one or more toxicology reference scores. 

[0045] Data, including nucleic acid hybridization data, may be transmitted to a server via any 
means available, including a secure direct dial-up or a secure or unsecured Internet 
connection. Toxicology prediction reports or any result of the methods herein may also be 
transmitted via these same mechanisms. For instance, a first user may transmit nucleic acid 
hybridization data to a remote server via a secure password protected Internet link and then 
request transmission of a toxicology report from the server via that same Internet link. 
[0046] Data transmitted by a remote user of a toxicity database or model may be raw, un- 
normalized data or may be normalized from various background parameters before 
transmission. For instance, data from a microarray may be normalized for various chip and 
background parameters such as those described above, before transmission. The data may be 
in any form, as long as the data can be recognized and properly formatted by available 
software or the software provided as part of a database or computer system. For instance, 
microarray data may be provided and transmitted in a .eel file or any other common data files 
produced from the analysis of microarray based hybridization on commercially available 
technology platforms (see, for instance, the Affymetrix GeneChip® Expression Analysis 
Technical Manual available at www.affVmetrix.com) . Such files may or may not be 
annotated with various information, for instance, but not limited to, information related to the 
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customer or remote user, cell or tissue sample data or information, hybridization technology 
or platform on which the data was generated and/or test agent data or information. 
[0047] Once data is received, the nucleic acid hybridization data may be screened for 
database compatibility by any available means. In one embodiment, commonly available 
data quality control metrics can be applied. For instance, outlier analysis methods or 
techniques may be utilized to identify samples incompatible with the database, for instance, 
samples exhibiting erroneous florescence values from control probes which are common 
between the data and the database or toxicity model. In addition, various data QC metrics 
can be applied, including one or more disclosed in PCT/US03/24160, filed August 1, 2003, 
which claims priority to U.S. provisional application 60/399,727. 

Cell or Tissue Sample Preparation 

[0048] As described above, the cell population that is exposed to the test agent, compound or 
composition may be exposed in vitro or in vivo. For instance, cultured or freshly isolated 
liver cells, in particular rat hepatocytes, may be exposed to the agent under standard 
laboratory and cell culture conditions. In another assay format, in vivo exposure may be 
accomplished by administration of the agent to a living animal, for instance a laboratory rat. 
[0049] Procedures for designing and conducting toxicity tests in in vitro and in vivo systems 
are well known, and are described in many texts on the subject, such as Loomis et al. y 
Loomis's Esstentials of Toxicology, 4th Ed., Academic Press, New York, 1996; Echobichon, 
The Basics of Toxicity Testing, CRC Press, Boca Raton, 1992; Frazier, editor, In Vitro 
Toxicity Testing, Marcel Dekker, New York, 1992; and the like. 

[0050] In in vitro toxicity testing, two groups of test organisms are usually employed. One 
group serves as a control, and the other group receives the test compound in a single dose (for 
acute toxicity tests) or a regimen of doses (for prolonged or chronic toxicity tests). Because, 
in some cases, the extraction of tissue as called for in the methods of the invention requires 
sacrificing the test animal, both the control group and the group receiving compound must be 
large enough to permit removal of animals for sampling tissues, if it is desired to observe the 
dynamics of gene expression through the duration of an experiment. 
[0051] In setting up a toxicity study, extensive guidance is provided in the literature for 
selecting the appropriate test organism for the compound being tested, route of 
administration, dose ranges, and the like. Water or physiological saline (0.9% NaCl in water) 
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is the solute of choice for the test compound since these solvents permit administration by a 
variety of routes. When this is not possible because of solubility limitations, vegetable oils 
such as corn oil or organic solvents such as propylene glycol may be used. 
[0052] Regardless of the route of administration, the volume required to administer a given 
dose is limited by the size of the animal that is used. It is desirable to keep the volume of 
each dose uniform within and between groups of animals. When rats or mice are used, the 
volume administered by the oral route generally should not exceed about 0.005 ml per gram 
of animal. Even when aqueous or physiological saline solutions are used for parenteral 
injection the volumes that are tolerated are limited, although such solutions are ordinarily 
thought of as being innocuous. The intravenous LD 50 of distilled water in the mouse is 
approximately 0.044 ml per gram and that of isotonic saline is 0.068 ml per gram of mouse. 
In some instances, the route of administration to the test animal should be the same as, or as 
similar as possible to, the route of administration of the compound to man for therapeutic 
purposes. 

[0053] When a compound is to be administered by inhalation, special techniques for 
generating test atmospheres are necessary. The methods usually involve aerosolization or 
nebulization of fluids containing the compound. If the agent to be tested is a fluid that has an 
appreciable vapor pressure, it may be administered by passing air through the solution under 
controlled temperature conditions. Under these conditions, dose is estimated from the 
volume of air inhaled per unit time, the temperature of the solution, and the vapor pressure of 
the agent involved. Gases are metered from reservoirs. When particles of a solution are to be 
administered, unless the particle size is less than about 2 (im the particles will not reach the 
terminal alveolar sacs in the lungs. A variety of apparati and chambers are available to 
perform studies for detecting effects of irritant or other toxic endpoints when they are 
administered by inhalation. The preferred method of administering an agent to animals is via 
the oral route, either by intubation or by incorporating the agent in the feed. 
[0054] When the agent is exposed to cells in vitro or in cell culture, the cell population to be 
exposed to the agent may be divided into two or more subpopulations, for instance, by 
dividing the population into two or more identical aliquots. In some preferred embodiments 
of the methods of the invention, the cells to be exposed to the agent are derived from liver 
tissue. For instance, cultured or freshly isolated rat hepatocytes may be used. 
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[0055] The methods of the invention may be used generally to predict at least one toxic 

us 

response, and, as described in the Examples, may be used to |>iredict the likelihood that a 
compound or test agent will induce various specific pathologies, such as liver cholestasis, 
genotoxicity/carcinogenesis, hepatitis, human-specific toxicity, induction of liver 
enlargement, steatosis, macrovesicular steatosis, microvesicular steatosis, necrosis, non- 
genotoxic/non-carcinogenic toxicity, peroxisome proliferation, rat non-genotoxic toxicity, 
general hepatotoxicity, or other, pathologies associated with at least one known toxin. The 
methods of the invention may also be used to determine the similarity of a toxic response to 
one or more individual compounds. In addition, the methods of the invention may be used to 
predict or elucidate the potential cellular pathways influenced, induced or modulated by the 
compound or test agent. 

Databases and Computer Systems 

[0056] Databases and computer systems of the present invention typically comprise one or 
more data structures comprising toxicity or toxicology models as described herein, including 
models comprising individual gene or toxicology marker weighted index scores or PLS 
scores (See Table 2), gene regulation scores, sample prediction scores and/or toxicity 
reference prediction scores. Such databases and computer systems may also comprise 
software that allows a user to manipulate the database content or to calculate or generate 
scores as described herein, including individual gene regulation scores and sample prediction 
scores from nucleic acid hybridization data. Software may also allow a user to predict, assay 
for or screen for at least one toxic response, including toxicity, hepatotoxicity, renal toxicity, 
etc, to include gene or protein pathway information and/or to include information related to 
the mechanism of toxicity, including possible cellular and molecular mechanisms. As an 
example, software may include at least one element from the Gene Logic ToxShield™ 
Predictive Modeling System such as software comprising at least one algorithm to convert 
hybridization data from varying platforms, for instance from one microarray platform to a 
second microarray platform (see U.S. Provisional Application 60/613,831, filed September 
29, 2004, which is herein incorporated by reference in its entirety for all purposes). 
[0057] As discussed above, the databases and computer systems of the invention may 
comprise equipment and software that allow access directly or through a remote link, such as 
direct dial-up access or access via a password protected Internet link. 
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[0058] Any available hardware may be used to create computer systems of the invention. 
Any appropriate computer platform, user interface, etc. may be used to perform the necessary 
comparisons between sequence information, gene or toxicology marker information and any 
other information in the database or information provided as an input. For example, a large 
number of computer workstations are available from a variety of manufacturers. 
Client/server environments, database servers and networks are also widely available and 
appropriate platforms for the databases of the invention. 

[0059] The databases may be designed to include different parts, for instance a sequence 
database and a toxicology reference database. Methods for the configuration and 
construction of such databases and computer-readable media containing such databases are 
widely available, for instance, see U.S. Publication No. 2003/0171876 (Serial No. 
10/090,144), filed March 5, 2002, PCT Publication No. WO 02/095659, published November 
23, 2002, and U.S. Patent No. 5,953,727, which are herein incoiporated by reference in their 
entirety. In a preferred embodiment, the database is a ToxExpress® or BioExpress® database 
marketed by Gene Logic Inc., Gaithersburg, MD. 

[0060] The databases of the invention may be linked to an outside or external database such 
as GenBank (wwwMcbiMlmMih.gov/entrez.index.html); KEGG (www.genome.ad.jp/kegg); 
SPAD (www.grt.kyushu-u.ac.jp/spad/index.html); HUGO (www.gene.ucl.ac.uk/hugo); Swiss- 
Prot (www.expasy.ch.sprot); Prosite (www.expasy.ch/tools/scnpsitl .html); OMIM 
(www.ncbi.nlmMih.gov/omim); and GDB (www.gdb.org). In a preferred embodiment, the 
external database is GenBank and the associated databases maintained by the National Center 
for Biotechnology Information (NCBI) (www.ncbi.nlm.nih.gov). 

Toxicity or Toxicology Reports 

[0061] As descried above, the methods, databases and computer systems of the invention 

can be used to produce, deliver and/or send a toxicity or toxicology report. As consistent 

with the use of the terms "toxicity" and "toxicology" as used herein, a "toxicity report" and a 
"toxicology report" are interchangeable. 

[0062] The toxicity report of the invention typically comprises information or data related to 
the results of the practice of a method of the invention. For instance, the practice of a method 
of identifying at least one toxic effect of a test agent or compound as herein described may 
result in the preparation or production of a report describing the results of the method 
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including an indication or prediction of at least one toxic response, such as toxicity, 
hepatotoxicity, renal toxicity, etc. The report may comprise information related to the toxic 
effects predicted by the comparison of at least one sample prediction score to at least one 
toxicity reference prediction score from the database as well as other related information such 
as a literature review or citation list and/or information regarding potential toxicity 
mechanism(s) of action, etc. The report may also present information concerning the nucleic 
acid hybridization data, such as the integrity of the data as well as information input by the 
user of the database and methods of the invention, such as information used to annotate the 
nucleic acid hybridization data. 

[0063J As an exemplary, non-limiting example, a toxicity report of the invention may be in a 
form such as the reports disclosed in PCT US02/22701, filed July 18, 2002, and U.S. 
Provisional Application 60/613,831, filed September 29, 2004, both of which are herein 
incorporated by reference in their entirety for all purposes. As described elsewhere in this 
specification, the report may be generated by a server or computer system to which is loaded 
nucleic acid hybridization data by a user. The report related to that nucleic acid data may be 
generated and delivered to the user via remote means such as a password secured 
environment available over the Internet or via available computer communication means such 
as email. 

Generating Nucleic Acid Hybridization Data 

[0064] Any assay format to detect gene expression may be used to produce nucleic acid 
hybridization data. For example, traditional Northern blotting, dot or slot blot, nuclease 
protection, primer directed amplification, KT- PCR, semi- or quantitative PCR, branched- 
chain DNA and differential display methods may be used for detecting gene expression levels 
or producing nucleic acid hybridization data. Those methods are useful for some 
embodiments of the invention^ In cases where smaller numbers of genes are detected, 
amplification based assays may be most efficient. Methods and assays of the invention, 
however, may be most efficiently designed with high-throughput hybridization-based 
methods for detecting the expression of a large number of genes. 

[0065] To produce nucleic acid hybridization data, any hybridization assay format may be 
used, including solution-based and solid support-based assay formats. Solid supports 
containing oligonucleotide probes for differentially expressed genes of the invention can be 
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filters, polyvinyl chloride dishes, particles, beads, microparticles or silicon or glass based 
chips, etc. Such chips, wafers and hybridization methods are widely available, for example, 
those disclosed by Beattie (WO 95/1 1755). 

[0066] Any solid surface to which oligonucleotides can be bound, either directly or 
indirectly, either covalently or non-covalently, can be used. A preferred solid support is a 
high density array or DNA chip. These contain a particular oligonucleotide probe in a 
predetermined location on the array. Each predetermined location may contain more than 
one molecule of the probe, but each molecule within the predetermined location has an 
identical sequence. Such predetermined locations are termed features. There may be, for 
example, from 2, 10, 100, 1000 to 10,000, 100,000 or 400,000 or more of such features on a 
single solid support. The solid support, or the area within which the probes are attached may 
be on the order of about a square centimeter. Probes corresponding to the genes of Tables 1- 
2 or from the related applications described above may be attached to single or multiple solid 
support structures, e.g., the probes may be attached to a single chip or to multiple chips to 
comprise a chip set. 

[0067] Oligonucleotide probe arrays, including bead assays or collections of beads, for 
expression monitoring can be made and used according to any techniques known in the art 
(see for example, Lockhart et al (1996), Nat Biotechnol 14:1675-1680; McGall etal (1996), 
Proc Nat Acad Sci USA 93 : 13555-1 3460). Such probe arrays may contain at least two or 
more oligonucleotides that are complementary to or hybridize to two or more of the genes 
described in Table 2. For instance, such arrays may contain oligonucleotides that are 
complementary to or hybridize to at least about 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 50, 70, 100, 
500 or 1,000 or more of the genes described herein. 

[0068] The sequences of the toxicity expression marker genes of Table 2 are in the public 
databases. Table 1 provides the SEQ ID NO: and GenBank Accession Number (NCBI 
RefSeq ID) for each of the sequences (see wmvMcbi.nlm.nih.gov/), as well as the title for the, 
cluster of which gene is part. The sequences of the genes in GenBank are expressly herein 
incorporated by reference in their entirety as of the filing date of this application, as are 
related sequences, for instance, sequences from the same gene of different lengths, variant 
sequences, polymorphic sequences, genomic sequences of the genes and related sequences 
from different species, including the human counterparts, where appropriate. 
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[0069] The terms "background" or "background signal intensity" refer to hybridization 
signals resulting from non-specific binding, or other interactions, between the labeled target 
nucleic acids and components of the oligonucleotide array (e.g., the oligonucleotide probes, 
control probes, the array substrate, etc.). Background signals may also be produced by 
intrinsic fluorescence of the array components themselves. A single background signal can 
be calculated for the entire array, or a different background signal may be calculated for each 
target nucleic acid. In a preferred embodiment, background is calculated as the average 
hybridization signal intensity for the lowest 5% to 10% of the probes in the array, or, where a 
different background signal is calculated for each target gene, for the lowest 5% to 10% of 
the probes for each gene. Of course, one of skill in the art will appreciate that where the 
probes to a particular gene hybridize well and thus appear to be specifically binding to a 
target sequence, they should not be used in a background signal calculation. Alternatively, 
background may be calculated as the average hybridization signal intensity produced by 
hybridization to probes that are not complementary to any sequence found in the sample (e.g. 
probes directed to nucleic acids of the opposite sense or to genes not found in the sample 
such as bacterial genes where the sample is mammalian nucleic acids). Background can also 
be calculated as the average signal intensity produced by regions of the array that lack any 
probes at all. 

[0070] The phrase "hybridizing specifically to" or "specifically hybridizes" refers to the 
binding, duplexing, or hybridizing of a molecule substantially to or only to a particular 
nucleotide sequence or sequences under stringent conditions when that sequence is present in 
a complex mixture {e.g., total cellular) DNA or RNA. 

[0071] As used herein a "probe" is defined as a nucleic acid, capable of binding to a target 
nucleic acid of complementary sequence through one or more types of chemical bonds, 
usually through complementary base pairing, usually through hydrogen bond formation. As 
used herein, a probe may include natural (Le., A, G, U,„C, or T) or modified bases (7- . _ 
deazaguanosine, inosine, etc.). In addition, the bases in probes may be joined by a linkage 
other than a phosphodiester bond, so long as it does not interfere with hybridization. Thus, 
probes may be peptide nucleic acids in which the constituent bases are joined by peptide 
bonds rather than phosphodiester linkages. 

Nucleic Acid Samples 
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[0072] Cell or tissue samples may be exposed to the test agent in vitro or in vivo. When 
cultured cells or tissues are used, appropriate mammalian cell extracts, such as liver extracts, 
may also be added with the test agent to evaluate agents that may require biotransformation 
to exhibit toxicity. In a preferred format, primary isolates or cultured cell lines of animal or 
human renal cells may be used. 

[0073] The genes which are assayed according to the present invention are typically in the 
form of mRNA or reverse transcribed mRNA. The genes may or may not be cloned. The 
genes may or may not be amplified. The cloning and/or amplification do not appear to bias 
the representation of genes within a population. In some assays, it may be preferable, 
however, to use polyA+ RNA as a source, as it can be used with fewer processing steps. 
[0074] As is apparent to one of ordinary skill in the art, nucleic acid samples used in the 
methods and assays of the invention may be prepared by any available method or process. 
Methods of isolating total mRNA are well known to those of skill in the art. For example, 
methods of isolation and purification of nucleic acids are described in detail in Chapter 3 of 
Laboratory Techniques in Biochemistry and Molecular Biology, Vol. 24 . Hybridization With 
Nucleic Acid Probes: Theory and Nucleic Acid Probes, P. Tijssen, Ed., Elsevier Press, New 
York, 1993. Such samples include RNA samples, but also include cDNA synthesized from a 
mRNA sample isolated from a cell or tissue of interest. Such samples also include DNA 
amplified from the cDNA, and RNA transcribed from the amplified DNA. One of skill in the 
art would appreciate that it is desirable to inhibit or destroy RNase present in homogenates 
before homogenates are used. 

[0075] Biological samples may be of any biological tissue or fluid or cells from any organism 
as well as cells raised in vitro, such as cell lines and tissue culture cells. Frequently the 
sample will be a tissue or cell sample that has been exposed to a compound, agent, drug, 
pharmaceutical composition, potential environmental pollutant or other composition. In 
some formats, the sample will be a "clinical sample" which is a sample derived from a 
patient. Typical clinical samples include, but are not limited to, sputum, blood, blood-cells 
(e.g., white cells), tissue or fine needle biopsy samples, urine, peritoneal fluid, and pleural 
fluid, or cells therefrom. Biological samples may also include sections of tissues, such as 
frozen sections or formalin fixed sections taken for histological purposes. 
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Hybridization 

[0076] Nucleic acid hybridization simply involves contacting a probe and target nucleic acid 
under conditions where the probe and its complementary target can form stable hybrid 
duplexes through complementary base pairing. See WO 99/32660. The nucleic acids that do 
not form hybrid duplexes are then washed away leaving the hybridized nucleic acids to be 
detected, typically through detection of an attached detectable label. It is generally 
recognized that nucleic acids are denatured by increasing the temperature or decreasing the 
salt concentration of the buffer containing the nucleic acids. Under low stringency conditions 
{e.g., low temperature and/or high salt) hybrid duplexes (e.g., DNArDNA, RNA:RNA, or 
RNArDNA) will form even where the annealed sequences are not perfectly complementary. 
Thus, specificity of hybridization is reduced at lower stringency. Conversely, at higher 
stringency (e.g., higher temperature or lower salt) successful hybridization tolerates fewer 
mismatches. One of skill in the art will appreciate that hybridization conditions may be 
selected to provide any degree of stringency. 

[0077] In a preferred embodiment, hybridization is performed at low stringency, in this case 
in 6x SSPET at 37°C (0.005% Triton X-100), to ensure hybridization and then subsequent 
washes are performed at higher stringency (e.g., lx SSPET at 37°C) to eliminate mismatched 
hybrid duplexes. Successive washes may be performed at increasingly higher stringency 
(e.g., down to as low as 0.25x SSPET at 37°C to 50°C) until a desired level of hybridization 
specificity is obtained. Stringency can also be increased by addition of agents such as 
formamide. Hybridization specificity may be evaluated by comparison of hybridization to 
the test probes with hybridization to the various controls that can be present (e.g., expression 
level control, normalization control, mismatch controls, etc.). 

[0078] In general, there is a tradeoff between hybridization specificity (stringency) and signal 
intensity. Thus, in a preferred embodiment, the wash is performed at the highest stringency 
that produces consistent results and that provides a signal intensity greater than the 
background intensity. Thus, in a preferred embodiment, the hybridized array may be washed 
at successively higher stringency solutions and read between each wash. Analysis of the data 
sets thus produced will reveal a wash stringency above which the hybridization pattern is not 
appreciably altered and which provides adequate signal for the particular oligonucleotide 
probes of interest. 
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Kits 

[0079] The invention further includes kits combining, in different combinations, high-density 
oligonucleotide arrays, reagents for use with the arrays, signal detection and array-processing 
instruments, toxicology databases and analysis and database management software described 
above. The kits may be used, for example, to predict or model the toxic response of a test 
compound. 

[0080] The databases that may be packaged with the kits are described above. In particular, 
the database software and packaged information may contain the databases saved to a 
computer-readable medium, or transferred to a user's local server. In another format, 
database and software information may be provided in a remote electronic format, such as a 
website, the address of which may be packaged in the kit. 

[0081] Databases and software designed for use with microarrays are discussed in Balaban et 
al y U.S. Patent Nos. 6,229,91 1, a computer-implemented method for managing information 
collected from small or large numbers of microarrays, and 6,185,561, a computer-based 
method with data mining capability for collecting gene expression level data, adding 
additional attributes and reformatting the data to produce answers to various queries. Chee et 
aL, U.S. Patent No. 5,974,164, disclose a software-based method for identifying mutations in 
a nucleic acid sequence based on differences in probe fluorescence intensities between wild 
type and mutant sequences that hybridize to reference sequences. 

[0082] Without further description, it is believed that one of ordinary skill in the art can, 
using the preceding description and the following illustrative examples, make and utilize the 
compounds of the present invention and practice the claimed methods. The following 
working examples therefore, specifically point out the preferred embodiments of the present 
invention, and are not to be construed as limiting in any way the remainder of the disclosure. 

EXAMPLES 

Example 1: Generation of Toxicity Models using RMA and PLS 
[0083] Various kidney toxins are administered to male Sprague-Dawley rats at various 
timepoints using administration diluents, protocols and dosing regimes as previously 
described in the art and previously described in the priority application discussed above. . 
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As an illustration of the protocols used, the toxins are administered to and animals are 
sacrificed and kidney samples harvested at the time points indicated below. 

OBSER VA TION OF ANIMALS 

[0084] 1. Clinical cage side observations- twice daily mortality and moribundity check. 
Skin and fur, eyes and mucous membrane, respiratory system, circulatory system, autonomic 
and central nervous system, somatomotor pattern, and behavior pattern are checked. 
Potential signs of toxicity, including tremors, convulsions, salivation, diarrhea, lethargy, 
coma or other atypical behavior or appearance, are recorded as they occur and include a time 
of onset, degree, and duration. 

[0085] 2. Physical Examinations-Prior to randomization, prior to initial treatment, and prior 
to sacrifice. 

[0086] 3. Body Weights-Prior to randomization, prior to initial treatment, and prior to 
sacrifice. 

CLINICAL PATHOLOGY 

[0087] 1. Frequency- Prior to necropsy. 

[0088] 2. Number of animals- All surviving animals. 

[0089] 3. Bleeding Procedure-Blood was obtained by puncture of the orbital sinus while 
under 70% C0 2 / 30% 0 2 anesthesia. 

[0090] 4. Collection of Blood Samples-Approximately 0.5 mL of blood is collected into 
EDTA tubes for evaluation of hematology parameters. Approximately 1 mL of blood is 
collected into serum separator tubes for clinical chemistry analysis. Approximately 200 \xL 
of plasma is obtained and frozen at — 80°C for test compound/metabolite estimation. An 
additional -2 mL of blood is collected into a 15 mL conical polypropylene vial to which -3 
mL of Trizol is immediately added. The contents are immediately mixed with a vortex and 
by repeated inversion. The tubes are frozen in liquid nitrogen and stored at - — 80°C. 

TERMINATION PROCEDURES 
Terminal Sacrifice 

[0091] At the time points indicated above, rats are weighed, physically examined, sacrificed 
by decapitation, and exsanguinated. The animals are necropsied within approximately five 
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minutes of sacrifice. Separate sterile, disposable instruments are used for each animal. 
Necropsies are conducted on each animal following procedures approved by board-certified 
pathologists. 

[0092] Animals not surviving until terminal sacrifice are discarded without necropsy 
(following euthanasia by carbon dioxide asphyxiation, if moribund). The approximate time 
of death for moribund or found dead animals is recorded. 

Postmortem Procedures 

[0093] All tissues are collected and frozen within approximately 5 minutes of the animal's 
death. Tissues are stored at approximately -80°C or preserved in 10% neutral buffered 
formalin. 

Tissue Collection and Processing 
[0094] Liver 

1. Right medial lobe -snap freeze in liquid nitrogen and store at — 80°C. 

2. Left medial lobe -Preserve in 10% neutral-buffered formalin (NBF) and evaluate for gross 
and microscopic pathology. 

3. Left lateral lobe -snap freeze in liquid nitrogen and store at — 80°C. 
[0095] Heart 

1. A sagittal cross-section containing portions of the two atria and of the two ventricles is 
preserved in 10% NBF. The remaining heart is frozen in liquid nitrogen and stored at ~ - 
80°C. 

[0096] Kidneys (both) 

1. Left - Hemi-dissect; half is preserved in 10% NBF and the remaining half is frozen in 
liquid nitrogen and stored at ~ -80°C. 

2. Right - Hemi-dissect; half is preserved in 10% NBF and the remaining half is frozen in_ _ 
liquid nitrogen and stored at ~ -80°C. 

[0097] Testes (both)-A sagittal cross-section of each testis is preserved in 10% NBF. The 
remaining testes are frozen together in liquid nitrogen and stored at — 80°C. 
[0098] Brain (whole)-A cross-section of the cerebral hemispheres and of the diencephalon 
are preserved in 10% NBF, and the rest of the brain is frozen in liquid nitrogen and stored at 
~ -80°C. 
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[0099] Microarray sample preparation is conducted with minor modifications, following the 
protocols set forth in the Affymetrix GeneChip® Expression Technical Analysis Manual 
(Affymetrix, Inc. Santa Clara, CA). Frozen tissue is ground to a powder using a Spex 
Certiprep 6800 Freezer Mill. Total RNA is extracted with Trizol (Invitrogen, Carlsbad CA) 
utilizing the manufacturer's protocol. mRNA is isolated using the Oligotex mRNA Midi kit 
(Qiagen) followed by ethanol precipitation. Double stranded cDNA is generated from 
mRNA using the Superscript Choice system (Invitrogen, Carlsbad CA). First strand cDNA 
synthesis is primed with a T7-(dT24) oligonucleotide. The cDNA is phenol-chloroform 
extracted and ethanol precipitated to a final concentration of 1 |ig/ml. From 2 jig of cDNA, 
cRNA is synthesized using Ambion's T7 MegaScript in vitro Transcription Kit. 
[00100] To biotin label the cRNA, nucleotides Bio-1 1-CTP and Bio-16-UTP (Enzo 
Diagnostics) are added to the reaction. Following a 37°C incubation for six hours, impurities 
are removed from the labeled cRNA following the RNeasy Mini kit protocol (Qiagen). 
cRNA is fragmented (fragmentation buffer consisting of 200 mM Tris-acetate, pH 8.1, 500 
mM KOAc, 150 mM MgOAc) for thirty-five minutes at 94°C. Following the Affymetrix 
protocol, 55 (ig of fragmented cRNA is hybridized on the Affymetrix rat array set for twenty- 
four hours at 60 rpm in a 45°C hybridization oven. The chips are washed and stained with 
Streptavidin Phycoerythrin (SAPE) (Molecular Probes) in Affymetrix fluidics stations. To 
amplify staining, SAPE solution is added twice with an anti-streptavidin biotinylated 
antibody (Vector Laboratories) staining step in between. Hybridization to the probe arrays is 
detected by fluorometric scanning (Hewlett Packard Gene Array Scanner). Data is analyzed 
using Affymetrix GeneChip® and Expression Data Mining (EDMT) software, the 
GeneExpress® database, and S-Plus® statistical analysis software (Insightful Corp.). 

Identification of Toxicity Markers and Model Building using RMA and PLS Algorithms 
[00101] RMA/PLS models are built as follows. From DNA microarray data from one or 
more studies, a matrix of RMA fold-change expression values is generated. These values are 
generated, for example, according to the method of Irizarry et al {Nucl Acids Res 31(4):el5, 
2003), which uses the following equation to produce a log scale linear additive model: 
T(PMij) = e* + aj + eg. T represents the transformation that corrects for background and 
normalizes and converts the PM (perfect match) intensities to a log scale, ej represents the 
log2 scale expression values found on arrays i = 1 - 1, aj represents the log scale affinity 
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effects for probes j = 1 - J, and ey represents error (to correct for the differences in variances 
when using probes that bind with different intensities). 

[00102] In RMA fold-change matrices, the rows represent individual fragments, and the 
columns are individual samples. A vehicle cohort median matrix is then calculated, in which 
the rows represent fragments and the columns represent vehicle cohorts, one cohort for each 
study/time-point combination. The values in this matrix are the median RMA expression 
values across the samples within those cohorts. Next, a matrix of normalized RMA 
expression values is generated, in which the rows represent individual fragments and the 
columns are individual samples. The normalized RMA values are the RMA values minus the 
value from the vehicle cohort median matrix corresponding to the time-matched vehicle 
cohort. PLS modeling is then applied to the normalized RMA matrix (a subset by taking 
certain fragments as described below), using a -1 = non-tox, +1 = tox supervised score vector 
as the dependant variable and the rows of normalized RMA matrix as the independent 
variables. PLS works by computing a series of PLS components, where each component is a 
weighted linear combination of fragment values. We use the nonlinear iterative partial least 
squares method to compute the PLS components. 

[00103] To select fragments, a vehicle cohort mean matrix is generated, in which the rows 
represent fragments and the columns represent vehicle cohorts, one cohort for each 
study/time-point combination. The values in this matrix are the mean RMA expression 
values across the samples within those cohorts. A treated cohort mean matrix is then 
generated, in which the rows represent fragments and the columns represent treated (non- 
vehicle) cohorts, one cohort for each study/time-point/compound/dose combination. The 
values in this matrix are the mean RMA expression values across the samples within those 
cohorts. Next, a treated cohort fold-change matrix is generated, in which the rows represent 
fragments and the columns represent treated cohorts, one cohort for each study/time- 
point/compound/dose combination. The values in this matrix are the values in the treated . . . _ 
cohort mean matrix minus the values in the vehicle cohort mean matrix corresponding to 
appropriate time-matched vehicle cohorts. Subsequently, a treated cohort p-value matrix is 
generated, in which the rows represent fragments and the columns represent treated cohorts, 
one cohort for each study/time-point/compound/dose combination. The values in this matrix 
are p- values based on two-sample t-tests comparing the treated cohort mean values to the 
vehicle cohort mean values corresponding to appropriate time-matched vehicle cohorts. This 
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matrix is converted to a binary coding based on the p-values being less than 0.05 (coded as 1) 
or greater than 0.05 (coded as 0). 

[00104] The row sums of the binary treated cohort p-value matrix are computed, where that 
row sum represents a "gene regulation score" for each fragment, representing the total 
number of treated cohorts where the fragment showed differential regulation (up- or down- 
regulation) compared to its time-matched vehicle cohort. PLS modeling and 2/3 / 1/3 cross- 
validation are then performed based on taking the top N fragments according to the regulation 
score, varying N and the number of PLS components, and recording the model success rate 
for each combination. N is chosen to be the point at which the cross-validated error rate are 
minimized. In the PLS model, each of those N fragments receives a PLS weight (PLS score) 
corresponding to the fragment's utility, or predictive ability, in the model (see Table 2 for an 
exemplary list of PLS scores for a kidney general toxicity model). 

Example 2: Methods of predicting at least one toxic effect of a test agent 
[00105] To determine whether or not a sample from an animal treated with a test agent or 
compound exhibits at least one toxic effect or response, RNA is prepared from a cell or tissue 
sample exposed to the agent and hybridized to a DNA microarray, as described in Example 1 
above. From the nucleic acid hybridization data, a prediction score is calculated for that 
sample and compared to a reference score from a toxicity reference database according to the 
following equation. The sample prediction score = Zwj R FCi . "i" is the index number for 
each gene in a gene expression profile to be evaluated, "w " is the PLS weight (or PLS score, 
see Table 2 for an exemplary list of PLS scores for a general kidney toxicity model) for each 
gene. "R FCi " is the RMA fold-change value for the i th gene, as determined from a normalized 
RMA matrix of gene expression data from the sample (described above). The PLS weight 
multiplied by the RMA fold-change value gives a gene regulation score for each gene, and 
the regulation scores for all the individual genes are added to give a prediction score for the - 
sample. 

[00106] As a quality control (QC) check, for each incoming study, an average correlation 
assessment is performed. After the RMA matrix is generated (genes by samples), a Pearson 
correlation matrix is calculated of the samples to each other. This matrix is samples by 
samples. For each sample row of the matrix, the mean of all correlation values in that row of 
the matrix, excluding the diagonal (which is always 1) is calculated. This mean is the 
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average correlation for that sample. If the average correlation is less than a threshold (for 
instance .90), the sample is flagged as a potential outlier. This process is repeated for each 
row (sample) in the study. Outliers flagged by the average correlation QC check are dropped 
out of any downstream normalization, prediction or compound similarity steps in the process. 
[00107] To establish a toxicity prediction score cut-off value for a toxicity model, the true- 
positive and false positive rates for each possible score cut-off value are computed, using the 
scores from all tox and non-tox samples in the training set. This generates an ROC curve, 
which we use to set the cut-off score at the point on the ROC curve corresponding to ~5% 
false positive rate. For example, in a kidney toxicity model of Table 2, a cut-off prediction 
score is about 0.3 1 8. If the sample score is about 0.3 1 8 or above, it can be predicted that the 
sample shows a toxic response after exposure to the test compound. If the sample score is 
below 0.318, it can be predicted that the sample does not show a toxic response 
[00108] The model can be trained by setting a score of -1 for each gene that cannot predict a 
toxic response and by setting a score of +1 for each gene that can predict a toxic response. 
Cross-validation of RMA/PLS models may be performed by the compound-drop method and 
by the 2/3:1/3 method. In the compound-drop method, sample data from animals treated with 
one particular test compound are removed from a model, and the ability of this model to 
predict toxicity is compared to that of a model containing a full data set. In the 2/3:1/3 
method, gene expression information from a random third of the genes in the model is 
removed, and the ability of this subset model to predict toxicity is compared to that of a 
model containing a full data set. 

[00109] Compound similarity is assessed in the following way. In the same manner as 
described above, a cohort fold-change vector for each study/time-point/compound/dose 
combination is calculated. This vector is reduced to only the fragments used in the PLS 
predictive models. We then calculate Pearson correlations for that cohort fold-change vector 
with each cohort vector (also reduced to only the fragments used in the PLS predictive 
models) in our reference database. Finally, these Pearson correlations are ranked from 
highest to lowest and the results are reported. 

[00110] A report may be generated comprising information or data related to the results of 
the methods of predicting at least one toxic effect. The report may comprise information 
related to the toxic effects predicted by the comparison of at least one sample prediction score 
to at least one toxicity reference prediction score from the database. The report may also 
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present information concerning the nucleic acid hybridization data, such as the integrity of 
the data as well as information inputted by the user of the database and methods of the 
invention, such as information used to annotate the nucleic acid hybridization data. See PCT 
US02/22701 for a non-limiting example of a toxicity report that may be generated. 

Example 3: Converting RMA data from one platform to another 

[00111] An algorithm was developed to convert probe intensity data from a first type of 
microarray to RMA data of a second type of microarray. This is beneficial to the customer 
because it provides the customer with the freedom to select the type of microarray it wishes 
to use with a RMA/PLS predictive model. Frequently this is the newest microarray on the 
market. The algorithm is beneficial for the company which builds RMA/PLS statistical 
models on microarray data because money and resources do not have to be expended to 
rebuild statistical models built on discontinued microarrays. 

[00112] The conversion algorithm developed can be used on data from the Affymetrix 
GeneChip® rat RAE 2.0 microarray to Affymetrix GeneChip® rat RGU34 A microarray 
data. This conversion also allows the use of RMA/PLS toxicogenomics models built on the 
Affymetrix RGU34 A microarray platform to predict customer data generated on the RAE2.0 
microarray platform. The conversion algorithm was tested using the liver toxicity model 
described in U.S. Provisional Application Serial No. 60/559,949 and herein incorporated by 
reference. 

[00113] The first step to using a conversion algorithm is to map microarray fragments. The 
RGU34 A microarray fragments which comprise the liver toxicity model were mapped to the 
RAE2.0 microarray. The liver toxicity model is based on 1,100 Affymetrix GeneChip® 
RGU34 A microarray fragments. Of the 1,100 fragments in the model, 907 were suggested 
by Affymetrix as matching to fragments on the RAE2.0 microarray. See Affymetrix' s 
"User's Guide to Product Comparison Spreadsheets" which is herein incorporated by 
reference. Another 105 fragments mapped to fragments sharing the same RefSeq ID and 55 
mapped to fragments which mapped to the same UniGene cluster. The 1067 mapping 
fragments were reduced to 1053. The 1053 mapped fragments represented 16 RGU34 A and 
1 1 RAE 2.0 probes. The 47 fragments which were not mapped to the RAE2.0 microarray 
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were assigned an RMA fold-change value of 0 for all samples and did not contribute to the 
prediction. 

[00114] Once the microarray fragments are mapped, training samples are selected to 
calculate the conversion model weights. The inventors searched Gene Logic's ToxExpress® 
reference database, a database which is built on the Affymetrix RGU34A platform, for 
samples that covered a large amount of interquartile range with respect to signal intensity. 
Samples that covered the largest amount of variable space were selected because this method 
of sample selection had previously been determined by the inventors to be reliable in the 
development of a human sample conversion algorithm. The samples maximized ( Max(Xy) 
- Min(Xy) ), where i indexes genes and j indexes samples. 

[00115] The inventors found that sample size calculations were stable at a sampling of 
approximately 100 microarrays. For this reason, a training set consisting of 100 compounds 
and vehicles from rat liver tissue was selected. 

[00116] The 100 training samples were used to train the weights in the conversion algorithm. 
This step is important because it provides for the quantitative aspect of the conversion. The 
weight training was performed based on a multiple regression analysis with probe values as 
the independent variables and RMA expression as the sum of the dependent variables. 
[00117] Test samples were evaluated using the trained conversion algorithm. The multiple 
regression model was built on the 1 1 perfect match probe intensities and generated a 
predicted RGU34 expression value from a weighted sum of RAE 2.0 probe values. Each test 
array was scaled to an average probe intensity of 10 (log scale). The conversion algorithm 
used is given as: 

Y rgu34 = p. o + log (Xij^o/S) 

where Y is the RGU34 RMA expression value for a fragment; Xjj** 82,0 for i=1...1053, 

j= 1 11 are perfect match probe intensity values for the marker genes on the RAE2.0 

microarray; S is a chip scale factor Z ij Xij RAE2 °/n . Probe intensities were first floored to the 
minimum intensity value of 30. 

[00118] Alternative approaches to using a multiple regression model exist to convert 

RAE2.0 data to RGU34 RMA data. Non-linear regression on probe values as well as 
canonical correlation of RAE2.0 probes to RGU34 A probes could be used. RMA values on 
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a RAE2.0 microarray could be computed and then scaled or quantile-normalized to RGU34 A 
RMA values. In addition, although the multiple regression analysis used in this example does 
not take into account mismatched probes, an analysis could be used which takes into account 
mismatched probes. 

[00119] The liver predictive model was used to compare the predictive results of test 

data from the RGU34 microarray to test data derived from converted RAE2.0 array data. The 
consistency between the RGU34 array results and the converted RAE2.0 array results was 
quite high. Table 3 provides the number of test samples per compound which were predicted 
as toxic out of the total number of samples for that compound using RGU34 RMA data and 
RAE2.0 converted RMA data. Amitryptilene, estradiol, amiodarone, diflunisal, 
phenobarbital, dioxin, ethionine, and LPS were selected as test toxicants. Clofibrate was 
selected because it is a rat-specific toxicant. Metformin, rosiglitazone, chlorpheniramine, and 
streptomycin were selected as test negative controls. The rat-specific toxicant and all of the 
tested negative controls correctly predicted no toxicity. 



Table 3 



Treatment 


RGU34 


RAE2.0 converted 


Amitryptilene 


1/2 


2/2 


Estradiol 


3/3 


3/3 


Amiodarone 


2/3 


2/3 


Diflunisal 


2/3 


2/3 


Phenobarbital 


3/3 


3/3 


Dioxin 


3/3 


2/3 


Ethionine 


3/3 


3/3 


LPS 


3/3 


3/3 


Clofibrate 


0/3 


0/3 


Metformin 


0/3 


0/3 


Rosiglitazone 


0/3 


0/3 


Chlorpheniramine 


0/3 


0/3 


Streptomycin 


0/3 


0/3 
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Example 4: Database 

[00120] A web-based software predictive modeling system called the ToxShield™ Suite was 
created which is composed of a collection of RMA/PLS toxicity predictive models. Liver 
RMA/PLS predictive models were built to allow a user to identify and classify various toxic 
and mechanistic responses to unknown or test compounds. The models represent a wide 
variety of endpoint pathologies and indications, including general toxicity, necrosis, steatosis, 
macrovesicular steatosis, microvesicular steatosis, cholestasis, hepatitis, carcinogenicity, 
genotoxic carcinogenicity, non-genotoxic carcinogenicity, rat specific non-genotoxic 
carcinogenicity, peroxisome proliferation, and inducer/liver enlargement. The outcome of 
toxicity models represents a detailed categorization of test or unknown compounds from 
which mechanistic information can be inferred. Although the current models available as 
part of this software system are related to liver toxicity, models relating to specific toxicities 
of other organs including, but not limited to, liver primary cell culture, kidney, heart, spleen, 
bone marrow, and brain could be used. 

[00121] The conversion algorithm described in Example 3 can be implemented in a software 
product such as the ToxShield™ Suite. The customer inputs his or her data that has been 
generated on a microarray such as the AfFymetrix RAE2.0 GeneChip® microarray platform. 
The software utilizes the algorithm to convert the customer's gene expression data to RMA 
data which is compatible with the software's toxicogenomics model built which was built 
exclusively on a second microarray platform such as the Affymetrix RGU34 A GeneChip® 
microarray. Visualizations and predictions can then be generated from the customer's data 
using the predictive model. 

[00122] Although the present invention has been described in detail with reference to 
examples above, it is understood that various modifications can be made without departing 
from the spirit of the invention. AccQrdingly,_the.invention is limited only by the following 
claims. All cited patents, patent applications and publications referred to in this application 
are herein incorporated by reference in their entirety. 
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